class: title-slide, right, top background-image: url(data:image/png;base64,#img/hex_tidyr.png), url(img/axsome_logo.png) background-position: 93% 63%, 50% 50% background-size: 10%, 40%
.right-column[ # Module 9: Data Tidying ### **Graham Eglit**<br> Axsome Therapeutics<br> Fall 2024 ] --- class: inverse, center, middle # Tidy Data Overview! ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Tidy Data Overview ] .pull-left[ - `tidyr` is designed to create tidy data - Tidy data is data where: - Every column is variable. - Every row is an observation. - Every cell is a single value. - Tidy data is a standard way of storing data that is used wherever possible throughout the tidyverse. - Tidy data leads to less time battling with tidyverse functions, more time analyzing data ] .pull-right[ .center[ <img src = "data:image/png;base64,#https://media.giphy.com/media/uWzDsAsRm2X9qULHLs/giphy.gif" /> .caption[ Via [Giphy](https://media.giphy.com/media/uWzDsAsRm2X9qULHLs/giphy.gif) ] ] ] --- .center[ # Tidy Data Overview ] .center[ <img src = "data:image/png;base64,#img/mod9/tidy_data.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] <br> .left[ [You can read more about Tidy data in Hadley Wickham's tidy data article](https://www.jstatsoft.org/article/download/v059i10/772) ] --- .center[ # Tidy and Non-Tidy Data Examples ] .panelset[ .panel[.panel-name[Example 1] |country | year| cases| population| |:-----------|----:|------:|----------:| |Afghanistan | 1999| 745| 19987071| |Afghanistan | 2000| 2666| 20595360| |Brazil | 1999| 37737| 172006362| |Brazil | 2000| 80488| 174504898| |China | 1999| 212258| 1272915272| |China | 2000| 213766| 1280428583| ]<!----> .panel[.panel-name[Example 2] |country | year|type | count| |:-----------|----:|:----------|----------:| |Afghanistan | 1999|cases | 745| |Afghanistan | 1999|population | 19987071| |Afghanistan | 2000|cases | 2666| |Afghanistan | 2000|population | 20595360| |Brazil | 1999|cases | 37737| |Brazil | 1999|population | 172006362| |Brazil | 2000|cases | 80488| |Brazil | 2000|population | 174504898| |China | 1999|cases | 212258| |China | 1999|population | 1272915272| |China | 2000|cases | 213766| |China | 2000|population | 1280428583| ]<!----> .panel[.panel-name[Example 3] |country | year|rate | |:-----------|----:|:-----------------| |Afghanistan | 1999|745/19987071 | |Afghanistan | 2000|2666/20595360 | |Brazil | 1999|37737/172006362 | |Brazil | 2000|80488/174504898 | |China | 1999|212258/1272915272 | |China | 2000|213766/1280428583 | ]<!----> .panel[.panel-name[Example 4] .pull-left[ Table: Cases |country | 1999| 2000| |:-----------|------:|------:| |Afghanistan | 745| 2666| |Brazil | 37737| 80488| |China | 212258| 213766| ] .pull-right[ Table: Population |country | 1999| 2000| |:-----------|----------:|----------:| |Afghanistan | 19987071| 20595360| |Brazil | 172006362| 174504898| |China | 1272915272| 1280428583| ] ]<!----> .panel[.panel-name[Example 5] |country |century |year |rate | |:-----------|:-------|:----|:-----------------| |Afghanistan |19 |99 |745/19987071 | |Afghanistan |20 |00 |2666/20595360 | |Brazil |19 |99 |37737/172006362 | |Brazil |20 |00 |80488/174504898 | |China |19 |99 |212258/1272915272 | |China |20 |00 |213766/1280428583 | ]<!----> ]<!--end panelset--> --- .center[ # Tidy Data is Built for Analysis! ] .pull-left[ ```r table1 %>% mutate(rate = cases / population * 10000) %>% knitr::kable() ``` |country | year| cases| population| rate| |:-----------|----:|------:|----------:|--------:| |Afghanistan | 1999| 745| 19987071| 0.372741| |Afghanistan | 2000| 2666| 20595360| 1.294466| |Brazil | 1999| 37737| 172006362| 2.193931| |Brazil | 2000| 80488| 174504898| 4.612363| |China | 1999| 212258| 1272915272| 1.667495| |China | 2000| 213766| 1280428583| 1.669488| ] .pull-right[ ```r ggplot(table1, aes(year, cases)) + geom_line(aes(group = country), color = "grey50") + geom_point(aes(color = country, shape = country)) + scale_x_continuous(breaks = c(1999, 2000)) ``` <!-- --> ] --- class: inverse, center, middle # Pivoting! ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Pivoting ] - Most real data is untidy - 80% of data science is data wrangling, 20% is analysis <br> <br> - `tidyr` provides several helpful functions for getting data into a usable format for analysis <br> <br> - the most common data tidying task is pivoting <br> <br> - `tidyr` includes two function for pivoting - `pivot_longer`: makes a dataset **longer** by reducing columns and increasing rows - `pivot_wider`: makes a dataset **wider** by increasing columns and reducing rows --- .center[ # table4a dataset ] .pull-left[ .center[**Table 4a**] |country | 1999| 2000| |:-----------|------:|------:| |Afghanistan | 745| 2666| |Brazil | 37737| 80488| |China | 212258| 213766| ] .pull-right[ .center[**Table 1**] |country | year| cases| |:-----------|----:|------:| |Afghanistan | 1999| 745| |Afghanistan | 2000| 2666| |Brazil | 1999| 37737| |Brazil | 2000| 80488| |China | 1999| 212258| |China | 2000| 213766| ] --- .center[ # table4a dataset ] - problem: the column names are not names of variables, but *values* of a variable - `1999` and `2000` represent values of the `year` variable - the values in `1999` and `2000` represent values of the `case` variable <br> <br> - the first column is ok...it partially identifies the observation - `country` <br> <br> - we'd like the years (`1999` and `2000`) to all fall under a single `year` column and the cases to fall in a separate `cases` column - `pivot_longer()`! --- .center[ # pivot_longer() ] .panelset[ .panel[.panel-name[pivot_longer] - there are three key arguments to `pivot_longer` 1. `cols` specifies which columns need to be pivoted, i.e. which columns aren’t variables. - This argument uses the same syntax as `select()` so here we could use `!country)` or `c(1999, 2000)`. 2. `names_to` names of the variable stored in the column names, here "year". 3. `values_to` names the variable stored in the cell values, here "cases". ]<!----> .panel[.panel-name[`pivot_longer` in action!] .pull-left[ ```r table4a %>% pivot_longer( cols = c("1999", "2000"), names_to = "year", values_to = "cases" ) ``` ] .pull-right[ |country |year | cases| |:-----------|:----|------:| |Afghanistan |1999 | 745| |Afghanistan |2000 | 2666| |Brazil |1999 | 37737| |Brazil |2000 | 80488| |China |1999 | 212258| |China |2000 | 213766| ] ]<!----> ]<!--end panelset--> --- .center[ # How does pivot_longer work? ] .panelset[ .panel[.panel-name[Step 1] - columns that are already variables need to be repeated, once for each column that is pivotted .center[ <img src = "data:image/png;base64,#img/mod9/step1.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] ]<!----> .panel[.panel-name[Step 2] - column names of pivoted columns become a new column .center[ <img src = "data:image/png;base64,#img/mod9/step2.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] ]<!----> .panel[.panel-name[Step 3] - number of values is preserved (not repeated), but unwound row-by-row .center[ <img src = "data:image/png;base64,#img/mod9/step3.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] ]<!----> ]<!--end panelset--> --- .center[ # pivot_wider ] - used when one observation is spread across multiple rows - less common than `pivot_longer()` <br> <br> - opposite interface to `pivot_longer()` <br> <br> - need to provide the existing columns that define the values - `values_from` - and need to provide the existing columns that contain the columns names - `names_from` --- .center[ # The table2 Dataset ] .pull-left[ .center[**Table 2**] |country | year|type | count| |:-----------|----:|:----------|---------:| |Afghanistan | 1999|cases | 745| |Afghanistan | 1999|population | 19987071| |Afghanistan | 2000|cases | 2666| |Afghanistan | 2000|population | 20595360| |Brazil | 1999|cases | 37737| |Brazil | 1999|population | 172006362| ] .pull-right[ .center[**Table 1**] |country | year| cases| population| |:-----------|----:|-----:|----------:| |Afghanistan | 1999| 745| 19987071| |Afghanistan | 2000| 2666| 20595360| |Brazil | 1999| 37737| 172006362| ] --- .center[ # pivot_wider() ] ```r table2 %>% pivot_wider(values_from = "count", names_from = "type") ``` |country | year| cases| population| |:-----------|----:|-----:|----------:| |Afghanistan | 1999| 745| 19987071| |Afghanistan | 2000| 2666| 20595360| |Brazil | 1999| 37737| 172006362| --- .center[ # How does pivot_wider work? ] .center[ <img src = "data:image/png;base64,#img/mod9/wide.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] --- .center[ # Now You Try! ] - using the `billboard` dataset in `tidyr`, create the following dataset
--- .center[ # Solution ] .panelset[ .panel[.panel-name[Context] - the first three columns are variables that describe the song - `artist` - `track` - `date.entered` - the next 76 columns describe the rank of the song in each week - (`wk1`-`wk76`) - we'd like the weeks to all fall under a single `week` column and the ranks to fall in a separate `rank` column - `pivot_longer()`! - missing values have been removed ]<!----> .panel[.panel-name[Solution] .pull-left[ ```r billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE ) ``` ] .pull-right[ |artist |track |date.entered |week | rank| |:------|:-----------------------|:------------|:----|----:| |2 Pac |Baby Don't Cry (Keep... |2000-02-26 |wk1 | 87| |2 Pac |Baby Don't Cry (Keep... |2000-02-26 |wk2 | 82| |2 Pac |Baby Don't Cry (Keep... |2000-02-26 |wk3 | 72| |2 Pac |Baby Don't Cry (Keep... |2000-02-26 |wk4 | 77| |2 Pac |Baby Don't Cry (Keep... |2000-02-26 |wk5 | 87| ] ]<!----> ]<!--end panelset--> --- class: inverse, center, middle # Separating and Uniting! ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Separate ] .panelset[ .panel[.panel-name[Overview] - `separate` pulls apart one column into multple columns, by splitting wherever a separator character appears |country | year|rate | |:-----------|----:|:-----------------| |Afghanistan | 1999|745/19987071 | |Afghanistan | 2000|2666/20595360 | |Brazil | 1999|37737/172006362 | |Brazil | 2000|80488/174504898 | |China | 1999|212258/1272915272 | |China | 2000|213766/1280428583 | ]<!----> .panel[.panel-name[Separate in Action!] .pull-left[ ```r table3 %>% separate(rate, into = c("cases", "population"), sep = "/") ``` - key arguments 1. `data` - dataset to perform separation on 2. `col` - column to separate 3. `into` - names of resulting columns 4. `sep` - value that defines the separator between columns - will default to first non-alphanumeric value if not specified ] .pull-right[ ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <int> <chr> <chr> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ] ]<!----> .panel[.panel-name[How does it work?] .center[ <img src = "data:image/png;base64,#img/mod9/separate.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] ]<!----> .panel[.panel-name[Convert] .pull-left[ ```r table3 %>% separate(rate, into = c("cases", "population"), sep = "/", convert = TRUE) ``` - by default, `separate` will not change the type of the column - to allow `separate` to change resulting column to a more appropriate type, set `convert = TRUE` ] .pull-right[ ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ] ]<!----> ]<!--end panelset--> --- .center[ # Unite ] .panelset[ .panel[.panel-name[Overview] .pull-left[ - `unite` is the inverse of `separate` - combines multiple columns into a single column - used less frequently than `separate` ] .pull-right[ |country |century |year |rate | |:-----------|:-------|:----|:-----------------| |Afghanistan |19 |99 |745/19987071 | |Afghanistan |20 |00 |2666/20595360 | |Brazil |19 |99 |37737/172006362 | |Brazil |20 |00 |80488/174504898 | |China |19 |99 |212258/1272915272 | |China |20 |00 |213766/1280428583 | ] ]<!----> .panel[.panel-name[Unite in Action!] .pull-left[ ```r table5 %>% unite("new", century, year, sep = "") ``` - key arguments 1. `data` - dataset to perform separation on 2. `col` - name of resulting columns 3. `...` - names of columns to unite - uses `dplyr::select`-style syntax 4. `sep` - the character used to separate united column values in new column - defaults to "_" ] .pull-right[ ``` ## # A tibble: 6 × 3 ## country new rate ## <chr> <chr> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ] ]<!----> .panel[.panel-name[How does it work?] .center[ <img src = "data:image/png;base64,#img/mod9/unite.png" /> .caption[ From [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) ] ] ]<!----> ]<!--end panelset--> --- class: inverse, center, middle # Missing Data! ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Explicit vs. Implicit Missing Data ] - a value can be missing in one of two possible ways - explicltly: flagged as `NA` - implicitly: simply not present in the data .pull-left[ ```r stocks <- tibble( year = c(2015, 2015, 2015, 2016, 2016), qtr = c( 1, 2, 3, 2, 3), return = c(1.88, 0.59, NA, 0.92, 0.17) ) ``` ] .pull-right[ ``` ## # A tibble: 5 × 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 NA ## 4 2016 2 0.92 ## 5 2016 3 0.17 ``` ] --- .center[ # complete ] - `complete()` can be used to make missing values explicit in tidy data - `complete()` takes a set of columns and finds all unique combinations - ensures the original dataset contains all those values, filling in explicit `NA`s where necessary ```r complete(stocks, year, qtr) ``` ``` ## # A tibble: 6 × 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 NA ## 4 2016 1 NA ## 5 2016 2 0.92 ## 6 2016 3 0.17 ``` --- .center[ # fill ] - `fill()` can be used to fill in missing values with other values in the column - `fill()` takes a set of columns wher eyou want mising values to be repalced by the most recent nonmissing value - last observation carried forward .pull-left[ ```r treatment <- tribble( ~ person, ~ treatment, ~response, "Derrick Whitmore", 1, 7, NA, 2, 10, NA, 3, 9, "Katherine Burke", 1, 4 ) ``` ] .pull-right[ ``` ## # A tibble: 4 × 3 ## person treatment response ## <chr> <dbl> <dbl> ## 1 Derrick Whitmore 1 7 ## 2 <NA> 2 10 ## 3 <NA> 3 9 ## 4 Katherine Burke 1 4 ``` ] --- .center[ # fill in action ] .pull-left[ - by default, fill will fill in a downward direction ```r treatment %>% fill(person) ``` ``` ## # A tibble: 4 × 3 ## person treatment response ## <chr> <dbl> <dbl> ## 1 Derrick Whitmore 1 7 ## 2 Derrick Whitmore 2 10 ## 3 Derrick Whitmore 3 9 ## 4 Katherine Burke 1 4 ``` ] .pull-right[ - you can use the `.direction` argument to change the fill direction ```r treatment %>% fill(person, .direction = "up") ``` ``` ## # A tibble: 4 × 3 ## person treatment response ## <chr> <dbl> <dbl> ## 1 Derrick Whitmore 1 7 ## 2 Katherine Burke 2 10 ## 3 Katherine Burke 3 9 ## 4 Katherine Burke 1 4 ``` ]